40 research outputs found

    Quasi-supervised learning for biomedical data analysis

    Get PDF
    We present a novel formulation for pattern recognition in biomedical data. We adopt a binary recognition scenario where a control dataset contains samples of one class only, while a mixed dataset contains an unlabeled collection of samples from both classes. The mixed dataset samples that belong to the second class are identified by estimating posterior probabilities of samples for being in the control or the mixed datasets. Experiments on synthetic data established a better detection performance against possible alternatives. The fitness of the method in biomedical data analysis was further demonstrated on real multi-color flow cytometry and multi-channel electroencephalography data. © 2010 Elsevier Ltd. All rights reserved

    Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences

    Get PDF
    Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences

    Annealing-based model-free expectation maximisation for multi-colour flow cytometry data clustering

    Get PDF
    This paper proposes an optimised model-free expectation maximisation method for automated clustering of high-dimensional datasets. The method is based on a recursive binary division strategy that successively divides an original dataset into distinct clusters. Each binary division is carriedout using a model-free expectation maximisation scheme that exploits the posterior probability computation capability of the quasi-supervised learningalgorithm subjected to a line-search optimisation over the reference set size parameter analogous to a simulated annealing approach. The divisions arecontinued until a division cost exceeds an adaptively determined limit. Experiment results on synthetic as well as real multi-colour flow cytometrydatasets showed that the proposed method can accurately capture the prominent clusters without requiring any prior knowledge on the number of clusters ortheir distribution models

    Automated recognition of cell phenotypes in histology images based on membrane- and nuclei-targeting biomarkers

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Three-dimensional <it>in vitro </it>culture of cancer cells are used to predict the effects of prospective anti-cancer drugs <it>in vivo</it>. In this study, we present an automated image analysis protocol for detailed morphological protein marker profiling of tumoroid cross section images.</p> <p>Methods</p> <p>Histologic cross sections of breast tumoroids developed in co-culture suspensions of breast cancer cell lines, stained for E-cadherin and progesterone receptor, were digitized and pixels in these images were classified into five categories using <it>k</it>-means clustering. Automated segmentation was used to identify image regions composed of cells expressing a given biomarker. Synthesized images were created to check the accuracy of the image processing system.</p> <p>Results</p> <p>Accuracy of automated segmentation was over 95% in identifying regions of interest in synthesized images. Image analysis of adjacent histology slides stained, respectively, for Ecad and PR, accurately predicted regions of different cell phenotypes. Image analysis of tumoroid cross sections from different tumoroids obtained under the same co-culture conditions indicated the variation of cellular composition from one tumoroid to another. Variations in the compositions of cross sections obtained from the same tumoroid were established by parallel analysis of Ecad and PR-stained cross section images.</p> <p>Conclusion</p> <p>Proposed image analysis methods offer standardized high throughput profiling of molecular anatomy of tumoroids based on both membrane and nuclei markers that is suitable to rapid large scale investigations of anti-cancer compounds for drug development.</p

    Fisher’s Linear Discriminant Analysis Based Prediction using Transient Features of Seismic Events in Coal Mines

    Get PDF
    2016 Federated Conference on Computer Science and Information Systems, FedCSIS 2016; Gdansk; Poland; 11 September 2016 through 14 September 2016Identification of seismic activity levels in coal mines is important to avoid accidents such as rockburst. Creating an early warning system that can save lives requires an automated way of predicting. This study proposes a prediction algorithm for the AAIA'16 Data Mining Challenge: Predicting Dangerous Seismic Events in Active Coal Mines that is based on transient activity features along with average indicators evaluated by a Fisher's linear discriminant analysis. Performance evaluation experiments on the training datasets revealed an accuracy level of around 0.9438 while the performance on the test dataset was at a level of 0.9297. These results suggest that the proposed approach achieves high accuracy in predicting danger seismic events while maintaining low complexity

    Hierarchical motif vectors for prediction of functional sites in amino acid sequences using quasi-supervised learning

    Get PDF
    We propose hierarchical motif vectors to represent local amino acid sequence configurations for predicting the functional attributes of amino acid sites on a global scale in a quasi-supervised learning framework. The motif vectors are constructed via wavelet decomposition on the variations of physico-chemical amino acid properties along the sequences. We then formulate a prediction scheme for the functional attributes of amino acid sites in terms of the respective motif vectors using the quasi-supervised learning algorithm that carries out predictions for all sites in consideration using only the experimentally verified sites. We have carried out comparative performance evaluation of the proposed method on the prediction of N-glycosylation of 55,184 sites possessing the consensus N-glycosylation sequon identified over 15,104 human proteins, out of which only 1,939 were experimentally verified N-glycosylation sites. In the experiments, the proposed method achieved better predictive performance than the alternative strategies from the literature. In addition, the predicted N-glycosylation sites showed good agreement with existing potential annotations, while the novel predictions belonged to proteins known to be modified by glycosylation.European Commission PIRG03-GA-2008-23090

    An efficient algorithm for large-scale quasi-supervised learning

    Get PDF
    We present a novel formulation for quasi-supervised learning that extends the learning paradigm to large datasets. Quasi-supervised learning computes the posterior probabilities of overlapping datasets at each sample and labels those that are highly specific to their respective datasets. The proposed formulation partitions the data into sample groups to compute the dataset posterior probabilities in a smaller computational complexity. In experiments on synthetic as well as real datasets, the proposed algorithm attained significant reduction in the computation time for similar recognition performances compared to the original algorithm, effectively generalizing the quasi-supervised learning paradigm to applications characterized by very large datasets.European Union (PIRG03-GA-2008-230903

    A computational analysis of Turkish makam music based on a probabilistic characterization of segmented phrases

    No full text
    This study targets automatic analysis of Turkish makam music pieces on the phrase level. While makam is most simply defined as an organization of melodic phrases, there has been very little effort to computationally study melodic structure in makam music pieces. In this work, we propose an automatic analysis algorithm that takes as input symbolic data in the form of machine-readable scores that are segmented into phrases. Using a measure of makam membership for phrases, our method outputs for each phrase the most likely makam the phrase comes from. The proposed makam membership definition is based on Bayesian classification and the algorithm is specifically designed to process the data with overlapping classes. The proposed analysis system is trained and tested on a large data set of phrases obtained by transferring phrase boundaries manually written by experts of makam music on printed scores, to machine-readable data. For the task of classifying all phrases, or only the beginning phrases to come from the main makam of the piece, the corresponding F-measures are.52 and.60 respectively.Scientific and Technological Research Council of Turkey, TUBITAK (112E162

    2-D thresholding of the connectivity map following the multiple sequence alignments of diverse datasets

    No full text
    10th IASTED International Conference on Biomedical Engineering, BioMed 2013; Innsbruck; Austria; 13 February 2013 through 15 February 2013Multiple sequence alignment (MSA) is a widely used method to uncover the relationships between the biomolecular sequences. One essential prerequisite to apply this procedure is to have a considerable amount of similarity between the test sequences. It's usually not possible to obtain reliable results from the multiple alignments of large and diverse datasets. Here we propose a method to obtain sequence clusters of significant intragroup similarities and make sense out of the multiple alignments containing remote sequences. This is achieved by thresholding the pairwise connectivity map over 2 parameters. The first one is the inferred pairwise evolutionary distances and the second parameter is the number of gapless positions on the pairwise comparisons of the alignment. Threshold curves are generated regarding the statistical parameter values obtained from a shuffled dataset and probability distribution techniques are employed to select an optimum threshold curve that eliminate as much of the unreliable connectivities while keeping the reliable ones. We applied the method on a large and diverse dataset composed of nearly 18000 human proteins and measured the biological relevance of the recovered connectivities. Our precision measure (0.981) was nearly 20% higher than the one for the connectivities left after a classical thresholding procedure displaying a significant improvement. Finally we employed the method for the functional clustering of protein sequences in a gold standard dataset. We have also measured the performance, obtaining a higher F-measure (0.882) compared to a conventional clustering operation (0.827)

    Novel techniques for model-free and fast computation of mutual information

    Get PDF
    26th IEEE Signal Processing and Communications Applications Conference, SIU 2018; Altin Yunus Resort ve Thermal Hotel, Izmir; Turkey; 2 May 2018 through 5 May 2018Bu çalışmada, iki rastlantısal değişken arasındaki ortak bilgi miktarının veri üzerinden hesaplanmasına yönelik yeni yaklaşımlar önerilmiştir. Bu yaklaşımlar, doğrusal dönüşüm altında diferansiyel entropinin gösterdiği özellikleri kullarak ve koşullu entropiyi modelden-bağımsız bir şekilde küçültmeye çalışarak kestirim yapacak şekilde kurgulanmıştır. Birim vektör parametrizasyonu ve veri oturtmaya dayanan tahmin edici olarak adlandırdığımız yöntemlerin, yaygın olarak kullanılan Kraskov yöntemiyle yapılan karşılaştırmalarda, örnek sayısı arttıkça işlem hızı açısından avantaj sağladığı görülmüştür.In this study, two new approaches are proposed to calculate mutual information between two random variables from data. These approaches are constructed in a way to use the properties of the differential entropy under linear transformations and to try to minimize conditional entropy in a model-free manner. In comparisons with a widely used mutual information estimator, the Kraskov method, the methods that we termed as unit vector parametrization and data fitting based estimators, offered an advantage in terms of computation speed
    corecore